Prithviraj Kadiyala

Prithviraj Kadiyala

1 My Video

2 Introduction

Salary - A fixed regular payment, typically paid on a monthly or biweekly basis but often expressed as an annual sum, made by an employer to an employee, especially a professional or white-collar worker.
Salary

The following was taken from Forbes Articles

There has been a lot of buzz going around in the software industry about unequal pays to the employees who have been working for a single company longer. and Who recently graduated and gets very good salary. There also have been a lot of articles written about unequal pays to the loyal employees and employees changing their jobs every 3-4 years getting almost 50% hike in salaries.

Those very new to the tech industry, with less than a year of experience, can expect to earn $50,321 (a year-over-year increase of 9.8 percent). After a year or two, that average salary jumps to $62,517 (a whooping 24.3 percent increase, year-over-year).

Spend three to five years, and the average leaps yet again, to $68,040 (a 6.3 percent increase). Between six and ten years in the industry, salaries hit $83,143 (a rise of 6.8 percent).

Breaking the ten-year mark translates into big bucks. Those with 11 to 15 years of experience could expect to pull down $96,792 (a 3.8 percent increase over last year), while those with more than 15 years average $115,399 (a 6 percent increase).

Below is the graph that shows us the salary hike when employees jump companies: Salary

The data was collected here:

2.1 What are the variables?

dataset = read.csv("Emp_Salary.csv",header=TRUE,sep=",")
head(dataset)
names(dataset)
## [1] "Employee" "EducLev"  "JobGrade" "YrsExper" "Age"      "Gender"  
## [7] "YrsPrior" "PCJob"    "Salary"

2.1.1 Plot data

library(s20x)
## Warning: package 's20x' was built under R version 3.4.4
pairs20x(dataset)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
g = ggplot(dataset, aes(x = YrsExper, y = Salary, color = EducLev)) + geom_point()
g = g + xlab("Years of Experience") 
g = g + geom_smooth(method = "loess")
g

2.2 How were the data collected?

2.3 What is the story behind the data?

2.4 Why was it gathered?

2.5 What is your interest in the data?

With the prospects of working in the software industry in the future. It would really cool to analyze the working of the IT industry beforehand and be prepared with what to do and when to do given the circumstances can put me into really good perspective of getting into the market and negotiating for a higher base salary package.

2.6 What problem do you wish to solve?

3 Theory needed to carry out SLR

3.1 Main result 1

3.2 Main result 2

3.3 Main result 3 etc

4 Validity with mathematical expressions

The following function was taken from https://rpubs.com/therimalaya/43190

4.1 Checks on validity

4.1.1 Straight trend line

4.1.1.1 Use trendscatter

trendscatter(YrsExper~Salary,f=0.5,data=dataset)

4.1.1.2 Shapiro-wilk

dataset.lm=lm(YrsExper~Salary,data=dataset)
summary(dataset.lm)
## 
## Call:
## lm(formula = YrsExper ~ Salary, data = dataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.232 -4.333 -1.343  3.625 27.119 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.584e+00  1.414e+00   -3.95 0.000107 ***
## Salary       3.822e-04  3.409e-05   11.21  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.52 on 206 degrees of freedom
## Multiple R-squared:  0.379,  Adjusted R-squared:  0.376 
## F-statistic: 125.7 on 1 and 206 DF,  p-value: < 2.2e-16
normcheck(dataset.lm,shapiro.wilk = TRUE)

4.1.2 Errors distributed Normally

The p-value for the shapiro-wilk test is 0. The null hypothesis in this case would be that the errors are distributed normally.

\[\epsilon_i \sim N(0,\sigma^2)\]

The results of the Shapiro-wilk test indicate that we have enough evidence against to reject the null hypothesis(as the p-value is 0 compared to the standard of comparison 0.05) leading us to the conclusion that the data is not normally distributed.

4.1.3 Constant variance

4.1.3.1 Residual vs fitted values

Yrs.res=residuals(dataset.lm)
Yrs.fit=fitted(dataset.lm)
plot(Yrs.fit,Yrs.res, xlab="Fitted", ylab="Residuals", main="Fitted vs Residuals")

4.1.3.2 trendscatter on Residual Vs Fitted

trendscatter(Yrs.fit,Yrs.res, xlab="Fitted", ylab="Residuals")

4.1.4 Zero mean value of \(\epsilon\)

plot(dataset.lm, which =1)

4.1.5 Independence of data

5 Model selection if you compared models

5.1 Use adjusted \(R^2\)

\[R_{adj}^2 =\]

6 Analysis of the data

6.1 Make sure you include many great plots

6.2 Add the trend to the data

6.3 Summary lm object

6.3.1 Interpretation of all tests

6.3.2 Interpretation of multiple R squared

6.3.3 Interpretation of all point estimates

6.4 Calculate cis for \(\beta\) parameter estimates

6.4.1 Use of predict()

6.4.2 Use of ciReg()

6.4.3 Check on outliers using cooks plots

Remember to interpret this plot and all other plots

7 Conclusion

7.1 Answer your research question

7.2 Suggest ways to improve model or experiment